A search engine for Arabic documents
Identifieur interne : 000B04 ( Main/Exploration ); précédent : 000B03; suivant : 000B05A search engine for Arabic documents
Auteurs : T. Sari [Algérie] ; A. Kefali [Algérie]Source :
Descripteurs français
- Wicri :
- topic : Recherche documentaire.
English descriptors
- mix :
Abstract
This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.
Url:
Affiliations:
Links toward previous steps (curation, corpus...)
- to stream Hal, to step Corpus: 000014
- to stream Hal, to step Curation: 000014
- to stream Hal, to step Checkpoint: 000120
- to stream Main, to step Merge: 000B15
- to stream Main, to step Curation: 000B04
Le document en format XML
<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">A search engine for Arabic documents</title>
<author><name sortKey="Sari, T" sort="Sari, T" uniqKey="Sari T" first="T." last="Sari">T. Sari</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID"><orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-300650" type="direct"><org type="institution" xml:id="struct-300650" status="VALID"><orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc><address><addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author><name sortKey="Kefali, A" sort="Kefali, A" uniqKey="Kefali A" first="A." last="Kefali">A. Kefali</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID"><orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-300650" type="direct"><org type="institution" xml:id="struct-300650" status="VALID"><orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc><address><addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">HAL</idno>
<idno type="RBID">Hal:hal-00334402</idno>
<idno type="halId">hal-00334402</idno>
<idno type="halUri">https://hal.archives-ouvertes.fr/hal-00334402</idno>
<idno type="url">https://hal.archives-ouvertes.fr/hal-00334402</idno>
<date when="2008-10">2008-10</date>
<idno type="wicri:Area/Hal/Corpus">000014</idno>
<idno type="wicri:Area/Hal/Curation">000014</idno>
<idno type="wicri:Area/Hal/Checkpoint">000120</idno>
<idno type="wicri:Area/Main/Merge">000B15</idno>
<idno type="wicri:Area/Main/Curation">000B04</idno>
<idno type="wicri:Area/Main/Exploration">000B04</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en">A search engine for Arabic documents</title>
<author><name sortKey="Sari, T" sort="Sari, T" uniqKey="Sari T" first="T." last="Sari">T. Sari</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID"><orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-300650" type="direct"><org type="institution" xml:id="struct-300650" status="VALID"><orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc><address><addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
<author><name sortKey="Kefali, A" sort="Kefali, A" uniqKey="Kefali A" first="A." last="Kefali">A. Kefali</name>
<affiliation wicri:level="1"><hal:affiliation type="laboratory" xml:id="struct-81739" status="VALID"><orgName>Laboratoire de gestion electronique de documents [Annaba]</orgName>
<orgName type="acronym">LabGED</orgName>
<desc><address><country key="DZ"></country>
</address>
</desc>
<listRelation><relation active="#struct-300650" type="direct"></relation>
</listRelation>
<tutelles><tutelle active="#struct-300650" type="direct"><org type="institution" xml:id="struct-300650" status="VALID"><orgName>Université Badji Mokhtar [Annaba]</orgName>
<desc><address><addrLine>BP 12, 23000, Annaba</addrLine>
<country key="DZ"></country>
</address>
<ref type="url">http://www.univ-annaba.dz/</ref>
</desc>
</org>
</tutelle>
</tutelles>
</hal:affiliation>
<country>Algérie</country>
</affiliation>
</author>
</analytic>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass><keywords scheme="mix" xml:lang="en"><term>Arabic handwriting recognition</term>
<term>Document retrieval</term>
<term>handwriting segmentation</term>
<term>handwriting segmentation.</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr"><term>Recherche documentaire</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">This paper is an attempt for indexing and searching degraded document images without recognizing the textual patterns and so to circumvent the cost and the laborious effort of OCR technology. The proposed approach deal with textual-dominant documents either handwritten or printed. From preprocessing and segmentation stages, all the connected components (CC) of the text are extracted applying a bottom-up approach. Each CC is then represented with global indices such as loops, ascenders, etc. Each document will be associated an ASCII file of the codes from the extracted features. Since there is no feature extraction technique reliable enough to locate all the discriminant global indices modelling handwriting or degraded prints, we apply an approximate string matching technique based on Levenshtein distance. As a result, the search module can efficiently cope with imprecise and incomplete pattern descriptions. The test was performed on some Arabic historical documents and shown good performances.</div>
</front>
</TEI>
<affiliations><list><country><li>Algérie</li>
</country>
</list>
<tree><country name="Algérie"><noRegion><name sortKey="Sari, T" sort="Sari, T" uniqKey="Sari T" first="T." last="Sari">T. Sari</name>
</noRegion>
<name sortKey="Kefali, A" sort="Kefali, A" uniqKey="Kefali A" first="A." last="Kefali">A. Kefali</name>
</country>
</tree>
</affiliations>
</record>
Pour manipuler ce document sous Unix (Dilib)
EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000B04 | SxmlIndent | more
Ou
HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000B04 | SxmlIndent | more
Pour mettre un lien sur cette page dans le réseau Wicri
{{Explor lien |wiki= Ticri/CIDE |area= OcrV1 |flux= Main |étape= Exploration |type= RBID |clé= Hal:hal-00334402 |texte= A search engine for Arabic documents }}
This area was generated with Dilib version V0.6.32. |